This capstone project is an opportunity for you to analyze a dataset and build predictive models that can provide insights to the Human Resources (HR) department of a large consulting firm.
Upon completion, you will have two artifacts that you would be able to present to future employers. One is a brief one-page summary of this project that you would present to external stakeholders as the data professional in Salifort Motors. The other is a complete code notebook provided here. Please consider your prior course work and select one way to achieve this given project question. Either use a regression model or machine learning model to predict whether or not an employee will leave the company. The exemplar following this actiivty shows both approaches, but you only need to do one.
In your deliverables, you will include the model evaluation (and interpretation if applicable), a data visualization(s) of your choice that is directly related to the question you ask, ethical considerations, and the resources you used to troubleshoot and find answers or solutions.
First, let's clarify:
You're predicting whether an employee will leave the company → this is a binary classification task, not regression.
So by "regression model," the question probably refers to logistic regression, which is commonly used for binary outcomes.
When to Use Logistic Regression (a type of regression model):
| Use logistic regression when... | Why |
|---|---|
| You need interpretability | Logistic regression gives you clear coefficients showing how each feature influences the outcome. Good for HR explaining "why" someone might leave. |
| Your data is relatively small and clean | It performs well with smaller datasets and is robust to overfitting. |
| The relationship between variables is mostly linear | It assumes a linear relationship between features and the log-odds of the outcome. |
| You want a fast baseline | It’s quick to implement and often a good first model to test. |
When to Use Machine Learning Models (like Random Forest, XGBoost, Neural Nets):
| Use ML models when... | Why |
|---|---|
| You care more about predictive accuracy | ML models often outperform logistic regression in terms of raw prediction accuracy, especially on complex data. |
| You have complex, nonlinear relationships | Algorithms like random forest or gradient boosting handle non-linearities and interactions better. |
| You have a lot of features or high-dimensional data | ML models can manage large feature sets and find hidden patterns. |
| You have enough data | Many ML models perform better with large datasets. |
Summary Table
| Scenario | Action |
|---|---|
| Small, interpretable dataset | Logistic Regression |
| Large, complex dataset | ML Model (Random Forest, XGBoost, etc.) |
| Need explainable predictions for business stakeholders | Logistic Regression or Explainable ML |
| Focused on maximizing prediction accuracy | ML Model |
Tip:
In practice, try both:
Start with logistic regression as a baseline.
Then try machine learning models to see if performance improves.
Use cross-validation and metrics like accuracy, precision, recall, F1-score, AUC to compare.
Consider the questions in your PACE Strategy Document to reflect on the Plan stage.
In this stage, consider the following:
The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to you as a data analytics professional and ask you to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?
Your goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.
If you can predict employees likely to quit, it might be possible to identify factors that contribute to their leaving. Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company.
The dataset that you'll be using in this lab contains 15,000 rows and 10 columns for the variables listed below.
Note: you don't need to download any data to complete this lab. For more information about the data, refer to its source on Kaggle.
| Variable | Description | |
|---|---|---|
| satisfaction_level | Employee-reported job satisfaction level [0–1] | |
| last_evaluation | Score of employee's last performance review [0–1] | |
| number_project | Number of projects employee contributes to | |
| average_monthly_hours | Average number of hours employee worked per month | |
| time_spend_company | How long the employee has been with the company (years) | |
| Work_accident | Whether or not the employee experienced an accident while at work | |
| left | Whether or not the employee left the company | |
| promotion_last_5years | Whether or not the employee was promoted in the last 5 years | |
| Department | The employee's department | |
| salary | The employee's salary (U.S. dollars) |
💭
[Double-click to enter your responses here.]
# For data manipulation
import numpy as np
import pandas as pd
# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# For displaying all of the columns in dataframes
pd.set_option('display.max_columns', None)
# Import packages for statistical analysis/hypothesis testing
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import chi2_contingency
# Import packages for data preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import resample
# For data modeling
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# For metrics and helpful functions
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.tree import plot_tree
# For saving models
import pickle
import os
Pandas is used to read a dataset called HR_capstone_dataset.csv. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.
# RUN THIS CELL TO IMPORT YOUR DATA.
# Load dataset into a dataframe
df0 = pd.read_csv("HR_capstone_dataset_Salifort.csv")
df = df0.copy()
# Display first few rows of the dataframe
df.head(10)
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
| 5 | 0.41 | 0.50 | 2 | 153 | 3 | 0 | 1 | 0 | sales | low |
| 6 | 0.10 | 0.77 | 6 | 247 | 4 | 0 | 1 | 0 | sales | low |
| 7 | 0.92 | 0.85 | 5 | 259 | 5 | 0 | 1 | 0 | sales | low |
| 8 | 0.89 | 1.00 | 5 | 224 | 5 | 0 | 1 | 0 | sales | low |
| 9 | 0.42 | 0.53 | 2 | 142 | 3 | 0 | 1 | 0 | sales | low |
# Gather basic information about the data
df.shape
(14999, 10)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 satisfaction_level 14999 non-null float64 1 last_evaluation 14999 non-null float64 2 number_project 14999 non-null int64 3 average_montly_hours 14999 non-null int64 4 time_spend_company 14999 non-null int64 5 Work_accident 14999 non-null int64 6 left 14999 non-null int64 7 promotion_last_5years 14999 non-null int64 8 Department 14999 non-null object 9 salary 14999 non-null object dtypes: float64(2), int64(6), object(2) memory usage: 1.1+ MB
# Gather descriptive statistics about the data
df.describe()
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
|---|---|---|---|---|---|---|---|---|
| count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
| mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
| std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
| min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
Satisfaction Level: The mean is 0.6128 and the median is 0.64, indicating the data is not skewed.
Average Monthly Hours: The distribution appears balanced, though some individuals work as few as 96 hours per month while others work up to 310.
Time Spent at the Company: The mean is 3.5 years, with a minimum of 2 years and a maximum of 10 years of tenure.
Left: This variable requires further analysis as it is the main outcome of interest, directly tied to employee retention.
Promotion in Last 5 Years: Only one employee has received a promotion in the past five years.
Department and Salary: Both could also be significant factors influencing employee retention and merit further exploration.
# Display all column names
df.columns
Index(['satisfaction_level', 'last_evaluation', 'number_project',
'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
'promotion_last_5years', 'Department', 'salary'],
dtype='object')
# Rename columns as needed
df = df.rename(columns={
'average_montly_hours': 'average_monthly_hours',
'Work_accident': 'work_accident',
'time_spend_company': 'tenure',
'Department': 'department'
})
Check for any missing values in the data.
# Check for missing values
df.isna().sum()
satisfaction_level 0 last_evaluation 0 number_project 0 average_monthly_hours 0 tenure 0 work_accident 0 left 0 promotion_last_5years 0 department 0 salary 0 dtype: int64
Check for any duplicate entries in the data.
# Check for duplicates
df.duplicated().sum()
np.int64(3008)
# Inspect some rows containing duplicates as needed
df[df.duplicated()]
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 396 | 0.46 | 0.57 | 2 | 139 | 3 | 0 | 1 | 0 | sales | low |
| 866 | 0.41 | 0.46 | 2 | 128 | 3 | 0 | 1 | 0 | accounting | low |
| 1317 | 0.37 | 0.51 | 2 | 127 | 3 | 0 | 1 | 0 | sales | medium |
| 1368 | 0.41 | 0.52 | 2 | 132 | 3 | 0 | 1 | 0 | RandD | low |
| 1461 | 0.42 | 0.53 | 2 | 142 | 3 | 0 | 1 | 0 | sales | low |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14994 | 0.40 | 0.57 | 2 | 151 | 3 | 0 | 1 | 0 | support | low |
| 14995 | 0.37 | 0.48 | 2 | 160 | 3 | 0 | 1 | 0 | support | low |
| 14996 | 0.37 | 0.53 | 2 | 143 | 3 | 0 | 1 | 0 | support | low |
| 14997 | 0.11 | 0.96 | 6 | 280 | 4 | 0 | 1 | 0 | support | low |
| 14998 | 0.37 | 0.52 | 2 | 158 | 3 | 0 | 1 | 0 | support | low |
3008 rows × 10 columns
# Drop duplicates and save resulting dataframe in a new variable as needed
# - 'first': Keeps the first occurrence of each duplicate and removes the rest
df = df.drop_duplicates(keep='first')
When looking at satisfaction level, the data shows a mean of 0.6128 and a median of 0.64, suggesting it’s not skewed and that most employees’ satisfaction is close to the average. Interestingly, while the standard deviation is relatively small at 0.143 (meaning most employees feel similarly satisfied), the range is quite wide at 0.96, which indicates that a few employees are either very satisfied or very dissatisfied. This variation in satisfaction levels might be worth exploring, as it could directly relate to why some people stay while others leave.
For average monthly hours, the data doesn’t show significant skew, but there’s a notable spread — some employees work as few as 96 hours a month, while others work up to 310. This suggests differences in workload that might affect stress levels and, ultimately, retention.
Regarding time spent at the company, the mean tenure is 3.5 years, with some employees having been there for as little as 2 years and others for as long as 10. This indicates a mix of new and long-serving staff that might have different perspectives on satisfaction and retention.
The variable left definitely needs closer examination. Since the goal is to understand and improve employee retention, analyzing why people leave will be crucial.
The promotion in the last five years data is telling: there has been only one promotion recorded in that period. This might reflect limited opportunities for growth, which can negatively impact retention.
Lastly, department and salary are also factors that could play a significant role in employee turnover and should be analyzed alongside the other variables to get a comprehensive view of what’s driving people to stay or leave.
# 1. T-test (satisfaction_level ~ left) + manual Cohen's d
group_left = df[df['left'] == 1]['satisfaction_level']
group_stayed = df[df['left'] == 0]['satisfaction_level']
ttest, p_value = stats.ttest_ind(group_left, group_stayed, equal_var=False)
print(f"T-test = {ttest:.4f}, p-value = {p_value:.6f}\n")
T-test = -35.8893, p-value = 0.000000
df_ttest = df.copy()
t_test_metric = df_ttest.drop(columns=['left', 'department', 'salary'])
metrics = {}
for metric in t_test_metric.columns:
group_left = df_ttest[df_ttest['left'] == 1][metric]
group_stayed = df_ttest[df_ttest['left'] == 0][metric]
# Convert to numeric in case of hidden strings
group_left = pd.to_numeric(group_left, errors='coerce')
group_stayed = pd.to_numeric(group_stayed, errors='coerce')
ttest, p_value = stats.ttest_ind(group_left, group_stayed, equal_var=False)
metrics[metric] = {'t_test': ttest, 'p_value': p_value}
# Convert dictionary to DataFrame
metrics_df = pd.DataFrame.from_dict(metrics, orient='index')
# Reset index to turn metric names into a column
metrics_df = metrics_df.reset_index().rename(columns={'index': 'metric'})
# Sort by p-value
metrics_df = metrics_df.sort_values(by='p_value')
# Optional: round values for readability
metrics_df = metrics_df.round({'t_test': 3})
metrics_df
| metric | t_test | p_value | |
|---|---|---|---|
| 0 | satisfaction_level | -35.889 | 1.193954e-228 |
| 4 | tenure | 24.050 | 5.787450e-119 |
| 5 | work_accident | -19.371 | 1.914951e-80 |
| 6 | promotion_last_5years | -7.816 | 6.311454e-15 |
| 3 | average_monthly_hours | 6.369 | 2.267947e-10 |
| 2 | number_project | 2.308 | 2.110164e-02 |
| 1 | last_evaluation | 1.298 | 1.943907e-01 |
# Create a contingency table
dept_contingency = pd.crosstab(df['department'], df['left'])
# Run chi-squared test
chi2_dept, p_dept, _, _ = chi2_contingency(dept_contingency)
print(f"Chi-squared test for department vs left: p-value = {p_dept}")
Chi-squared test for department vs left: p-value = 0.01329832963300122
salary_contingency = pd.crosstab(df['salary'], df['left'])
chi2_salary, p_salary, _, _ = chi2_contingency(salary_contingency)
print(f"Chi-squared test for salary vs left: p-value = {p_salary}")
Chi-squared test for salary vs left: p-value = 8.984123357404531e-39
metrics_df = pd.concat([
metrics_df,
pd.DataFrame([
{'metric': 'department (chi2)', 't_test': None, 'p_value': p_dept},
{'metric': 'salary (chi2)', 't_test': None, 'p_value': p_salary}
])
])
# Sort again by p-value
metrics_df = metrics_df.sort_values(by='p_value').reset_index(drop=True)
print(metrics_df)
metric t_test p_value 0 satisfaction_level -35.889 1.193954e-228 1 tenure 24.050 5.787450e-119 2 work_accident -19.371 1.914951e-80 3 salary (chi2) NaN 8.984123e-39 4 promotion_last_5years -7.816 6.311454e-15 5 average_monthly_hours 6.369 2.267947e-10 6 department (chi2) NaN 1.329833e-02 7 number_project 2.308 2.110164e-02 8 last_evaluation 1.298 1.943907e-01
C:\Users\barba\AppData\Local\Temp\ipykernel_7988\40910853.py:1: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. metrics_df = pd.concat([
When analyzing the differences between employees who stayed and those who left, several key metrics show strong statistical significance, highlighting factors that could be influencing turnover.
| Metric | Test Type | Test Statistic | p-value | Interpretation |
|---|---|---|---|---|
| Satisfaction level | t-test | -35.889 | 1.19e-228 | Strong difference in satisfaction between employees who leave vs. stay; critical for retention. |
| Tenure | t-test | 24.050 | 5.79e-119 | Time at company strongly related to turnover; major factor in retention. |
| Work accident | t-test | -19.371 | 1.91e-80 | Significant difference in accident status; may relate to safety culture and working conditions. |
| Salary | chi-square | — | 8.98e-39 | Salary bands strongly associated with turnover; important for further exploration. |
| Promotion last 5 years | t-test | -7.816 | 6.31e-15 | Lack of promotions linked to departures; significant factor. |
| Average monthly hours | t-test | 6.369 | 2.27e-10 | Differences in working hours affect turnover, though effect size smaller than others. |
| Department | chi-square | — | 0.013 | Departmental differences impact retention; workload or environment may influence turnover. |
| Number of projects | t-test | 2.308 | 0.021 | Small but significant difference between groups. |
| Last evaluation | t-test | 1.298 | 0.194 | No significant difference; performance scores less predictive of turnover here. |
Overall, satisfaction level, tenure, work accident, salary, promotion, and departmental differences emerge as key areas to address in order to improve employee retention.
metrics_df['log_p_value'] = -np.log10(metrics_df['p_value'])
# Plot
sns.barplot(data=metrics_df, x='metric', y='log_p_value')
plt.xticks(rotation=45, ha='right')
plt.ylabel('log10(p-value)')
plt.title('Log-scaled P-values for Metrics')
plt.tight_layout()
plt.show()
# Resignation rate by salary
fig, ax = plt.subplots(1, 2, figsize=(15, 6))
sns.barplot(x='salary', y='left', data=df, ax=ax[0])
ax[0].set_title("Turnover Rate by Salary")
# Resignation rate by department
sns.barplot(x='department', y='left', data=df, ax=ax[1])
ax[1].tick_params(axis='x', rotation=45)
ax[1].set_title("Turnover Rate by Department")
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(1, 2, figsize = (22,8))
# Create boxplot showing `average_monthly_hours` distributions for `number_project`, comparing employees who stayed versus those who left
sns.boxplot(data=df, x='average_monthly_hours', y='number_project', hue='left', orient="h", ax=ax[0])
ax[0].invert_yaxis()
ax[0].set_title('Monthly hours by number of projects', fontsize='14')
# Create histogram showing distribution of `number_project`, comparing employees who stayed versus those who left
sns.histplot(data=df, x='number_project', hue='left', multiple='dodge', shrink=2, ax=ax[1])
ax[1].set_title('Number of projects histogram', fontsize='14')
# Display the plots
plt.show()
Normal working hours per month:
50 weeks * 40 hours per week / 12 months = 166.67 hours per month# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (22,8))
# Define short-tenured employees
tenure_short = df[df['tenure'] < 7]
# Define long-tenured employees
tenure_long = df[df['tenure'] > 6]
# Plot short-tenured histogram
sns.histplot(data=tenure_short, x='tenure', hue='salary', discrete=1,
hue_order=['low', 'medium', 'high'], multiple='dodge', shrink=.5, ax=ax[0])
ax[0].set_title('Salary histogram by tenure: short-tenured people', fontsize='14')
# Plot long-tenured histogram
sns.histplot(data=tenure_long, x='tenure', hue='salary', discrete=1,
hue_order=['low', 'medium', 'high'], multiple='dodge', shrink=.4, ax=ax[1])
ax[1].set_title('Salary histogram by tenure: long-tenured people', fontsize='14')
Text(0.5, 1.0, 'Salary histogram by tenure: long-tenured people')
Check for outliers in the data.
# Take only numerical metrics
metrics = df.select_dtypes(include='number').columns
# Take off binary metrics from numerical metrics
metrics_non_binary = [col for col in metrics if df[col].nunique() > 2]
for metric in metrics_non_binary:
# Determine the number of rows containing outliers
perc75 = df[metric].quantile(0.75)
perc25 = df[metric].quantile(0.25)
iqr = perc75 - perc25
upper_limit = perc75 + 1.5 * iqr
lower_limit = perc25 - 1.5 * iqr
outliers = df[(df[metric] > upper_limit) | (df[metric] < lower_limit)]
print(f"\nMetric: {metric}")
print("Lower limit:", lower_limit)
print("Upper limit:", upper_limit)
print(f"Number of outliers in {metric}: {len(outliers)}")
Metric: satisfaction_level Lower limit: -0.030000000000000027 Upper limit: 1.33 Number of outliers in satisfaction_level: 0 Metric: last_evaluation Lower limit: 0.1349999999999999 Upper limit: 1.295 Number of outliers in last_evaluation: 0 Metric: number_project Lower limit: 0.0 Upper limit: 8.0 Number of outliers in number_project: 0 Metric: average_monthly_hours Lower limit: 28.0 Upper limit: 372.0 Number of outliers in average_monthly_hours: 0 Metric: tenure Lower limit: 1.5 Upper limit: 5.5 Number of outliers in tenure: 824
Begin by understanding how many employees left and what percentage of all employees this figure represents.
# Get numbers of people who left vs. stayed
status_mapping = {
0:'Stayed',
1:'Left'
}
print("\nNumber of People Who Left vs. Stayed:")
print(df['left'].map(status_mapping).value_counts().astype(str))
# Get percentages of people who left vs. stayed
print("\nPercentage of People Who Left vs. Stayed:")
print(df['left'].map(status_mapping).value_counts(normalize=True).round(2).astype(str)+"%")
Number of People Who Left vs. Stayed: left Stayed 10000 Left 1991 Name: count, dtype: object Percentage of People Who Left vs. Stayed: left Stayed 0.83% Left 0.17% Name: proportion, dtype: object
Now, examine variables that you're interested in, and create plots to visualize relationships between variables in the data.
# Create a plot as needed
sns.boxplot(x=df['satisfaction_level'])
plt.title('Satisfaction Level Distribution')
plt.xlabel('Satisfaction Level')
Text(0.5, 0, 'Satisfaction Level')
# Create a plot as needed
sns.boxplot(x=df['average_monthly_hours'])
plt.title('Average Monthly Hours Boxplot')
plt.xlabel('Average Monthly Hours')
Text(0.5, 0, 'Average Monthly Hours')
# Create a plot as needed
sns.boxplot(x=df['tenure'])
plt.title('Tenure Boxplot')
plt.xlabel('Tenure (years)')
Text(0.5, 0, 'Tenure (years)')
# Create a plot as needed
counts = df['left'].value_counts()
labels = ['Stayed', 'Left']
plt.pie(counts, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Left Employees')
plt.show()
fig, ax = plt.subplots(1, 2, figsize=(12,6))
sns.countplot(data=df, x='promotion_last_5years', hue='left', ax=ax[0])
ax[0].set_title('Left vs Promotion Last 5 Years')
ax[0].legend(title='Left', labels=['Stayed', 'Left'])
sns.boxplot(data=df, x='tenure', y='number_project', ax=ax[1])
ax[1].set_title('Tenure vs Number of Projects')
ax[1].set_xlabel('Tenure (years)')
ax[1].set_ylabel('Number of Projects')
plt.xticks(rotation=45)
plt.tight_layout()
# Get numbers of people who left vs. stayed
status_mapping = {
0:'Not promoted',
1:'Promoted'
}
print("\nNumber of People Promoted vs Not Promoted:")
print(df['promotion_last_5years'].map(status_mapping).value_counts().to_string())
# Get percentages of people who left vs. stayed
print("\nPercentage of People Promoted vs Not Promoted:")
print(df['promotion_last_5years'].map(status_mapping).value_counts(normalize=True).round(2).astype(str)+"%")
Number of People Promoted vs Not Promoted: promotion_last_5years Not promoted 11788 Promoted 203 Percentage of People Promoted vs Not Promoted: promotion_last_5years Not promoted 0.98% Promoted 0.02% Name: proportion, dtype: object
# Create a plot as needed
plt.figure(figsize=(20,12))
sns.histplot(data = df, x = 'department', hue = 'left', multiple = 'dodge', shrink = 0.8)
plt.xlabel('Department', fontsize=14)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.legend(title='Left', labels=['Stayed', 'Left'])
plt.ylabel('Number of Employees', fontsize=14)
plt.xticks(rotation=45)
plt.title('Distribution of Employees by Department', fontsize=20)
Text(0.5, 1.0, 'Distribution of Employees by Department')
# Create a plot as needed
df['department'].value_counts()
department sales 3239 technical 2244 support 1821 IT 976 RandD 694 product_mng 686 marketing 673 accounting 621 hr 601 management 436 Name: count, dtype: int64
plt.figure(figsize=(20,12))
plt.xticks(fontsize=18)
plt.yticks(fontsize=16)
plt.xlabel('Satisfaction level', fontsize=16)
sns.histplot(x ='satisfaction_level', hue = 'salary', multiple='stack', data=df)
plt.title('Satisfaction Level by Salary', fontsize=20)
legend = plt.gca().get_legend()
legend.get_title().set_fontsize(18)
for text in legend.get_texts():
text.set_fontsize(16)
df['salary'].value_counts()
salary low 5740 medium 5261 high 990 Name: count, dtype: int64
df_sat_cat = df.copy()
df_sat_cat['satisfaction_cat'] = pd.cut(
df['satisfaction_level'],
bins=[0, 0.4, 0.7, 1.0],
labels=['Low', 'Medium', 'High']
)
# - pd.cut() → This function bins the values into predefined ranges
# - bins=[0, 0.4, 0.7, 1.0] → Defines the intervals
# - 0 to 0.4 → Assigned to 'Low'.
# - 0.4 to 0.7 → Assigned to 'Medium'.
# - 0.7 to 1.0 → Assigned to 'High'.
satisfaction_counts = df_sat_cat['satisfaction_cat'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(satisfaction_counts, labels=satisfaction_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Satisfaction Level Categories')
plt.show()
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
sns.countplot(data=df, x='salary', hue='left', ax=ax[0])
ax[0].set_title('Leaving Employees by Salary Level')
ax[0].legend(title='Left', labels=['Stayed', 'Left'])
ax[0].set_xlabel('Salary Level')
ax[0].set_ylabel('Number of Employees')
sns.histplot(data=df, x='satisfaction_level', hue='left', multiple='stack', ax=ax[1])
ax[1].set_title('Satisfaction Level vs Leaving Employees')
ax[1].legend(title='Left', labels=['Stayed', 'Left'])
ax[1].set_xlabel('Satisfaction Level')
ax[1].set_ylabel('Number of Employees')
Text(0, 0.5, 'Number of Employees')
sns.scatterplot(data=df, x='average_monthly_hours', y='satisfaction_level', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='orange', label='166.67 hrs./mo.', ls='--')
plt.title('Average Monthly Hours vs Satisfaction Level')
Text(0.5, 1.0, 'Average Monthly Hours vs Satisfaction Level')
plt.figure(figsize=(16, 9))
sns.scatterplot(data=df, x='average_monthly_hours', y='last_evaluation', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='orange', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='16')
Text(0.5, 1.0, 'Monthly hours by last evaluation score')
# Create a new dataframe with only numeric columns
df_corr = df.select_dtypes(include='number')
corr = df_corr.corr()
corr_target_sorted = corr.sort_values(by='left', ascending=False)
# Reorder rows and columns
ordered_corr = corr.loc[corr_target_sorted.index, corr_target_sorted.index]
# Plot with mask
mask = np.triu(np.ones_like(ordered_corr, dtype=bool))
plt.figure(figsize=(12, 10))
sns.heatmap(ordered_corr, mask=mask, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Sorted Correlation Matrix')
plt.show()
Strongest correlation with left:
| Feature | Corr with left |
Interpretation |
|---|---|---|
satisfaction_level |
-0.35 | Strong negative correlation. Low satisfaction → more likely to leave. |
tenure |
+0.17 | Slight positive correlation. More years → slightly more likely to leave. |
work_accident |
-0.13 | Weak negative correlation. Having had an accident → slightly less likely to leave. |
promotion_last_5years |
-0.04 | Negligible impact. No real trend. |
What’s weak or limited:
1. Low correlations across the board
Besides left (-0.35), all other variables have very low correlation with satisfaction (close to 0).
This means satisfaction is mostly unaccounted for by these internal metrics.
Could be due to:
Unmeasured variables (e.g. team culture, management style, stress, recognition)
Noise in how satisfaction was recorded or defined
2. Satisfaction is hard to explain with current features
Even with regression, only last_evaluation has a notable coefficient.
The R² will probably be very low, meaning the model barely explains satisfaction.
Good enough for predictive modeling of attrition (left)
Weak for explaining satisfaction, which may rely on qualitative/external factors
Logistic Regression model assumptions
# Copy the dataframe
df_enc = df.copy()
# Encode the `salary` column as an ordinal numeric category
df_enc['salary'] = (
df_enc['salary'].astype('category')
.cat.set_categories(['low', 'medium', 'high'])
.cat.codes
)
# Dummy encode the `department` column
df_enc = pd.get_dummies(df_enc, drop_first=False)
# Display the new dataframe
df_enc.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
# Create a heatmap to visualize how correlated variables are
plt.figure(figsize=(8, 6))
sns.heatmap(df_enc[['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'tenure']]
.corr(), annot=True, cmap="crest")
plt.title('Heatmap of the dataset')
plt.show()
# Create a stacked bart plot to visualize number of employees across department, comparing those who left with those who didn't
pd.crosstab(df['department'], df['left']).plot(kind ='bar',color='mr')
plt.title('Counts of employees who left versus stayed across department')
plt.legend(title='Left', labels=['Stayed', 'Left'])
plt.ylabel('Employee count')
plt.xlabel('Department')
plt.show()
# Since logistic regression is quite sensitive to outliers, it would be a good idea at this stage to remove the outliers in the `tenure` column that were identified earlier.
df_log_reg = df_enc[(df_enc['tenure'] >= lower_limit) & (df_enc['tenure'] <= upper_limit)]
df_log_reg.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 5 | 0.41 | 0.50 | 2 | 153 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
y = df_log_reg['left']
X = df_log_reg.drop(columns=['left'], axis=1)
X_train_log_reg, X_test_log_reg, y_train_log_reg, y_test_log_reg = train_test_split(X, y, test_size=0.2, random_state=42)
# the classifier object (clf) has attributes like .coef_ and .intercept_, which store the learned coefficients and intercept.
log_clf = LogisticRegression(random_state=42, max_iter=500).fit(X_train_log_reg, y_train_log_reg)
# y = predictive value, x = features
y_pred = log_clf.predict(X_test_log_reg)
# Compute the Confusion Matrix
log_cm = confusion_matrix(y_test_log_reg, y_pred)
# Compute Display for conbfusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix = log_cm, display_labels = log_clf.classes_)
# Plot the confusion matrix
log_disp.plot(values_format='')
# Display plot
plt.show()
# Compute Classification Report for logistic regression model
target_names = ['Predicted would not leave', 'Predicted would leave']
print(classification_report(y_test_log_reg, y_pred, target_names=target_names))
precision recall f1-score support
Predicted would not leave 0.86 0.94 0.90 1846
Predicted would leave 0.49 0.26 0.34 388
accuracy 0.82 2234
macro avg 0.67 0.60 0.62 2234
weighted avg 0.79 0.82 0.80 2234
y = df_enc['left']
X = df_enc.drop('left', axis=1)
X_train_tree1, X_test_tree1, y_train_tree1, y_test_tree1 = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
tree = DecisionTreeClassifier(random_state=0)
cv_params = {'max_depth':[4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
# Instantiate GridSearch
tree1 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
tree1.fit(X_train_tree1, y_train_tree1)
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring={'accuracy': 'accuracy', 'f1': 'f1',
'precision': 'precision', 'recall': 'recall',
'roc_auc': 'roc_auc'})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring={'accuracy': 'accuracy', 'f1': 'f1',
'precision': 'precision', 'recall': 'recall',
'roc_auc': 'roc_auc'})DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=0)
DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=0)
y_pred_tree1 = tree1.best_estimator_.predict(X_test_tree1)
tree1.best_params_
{'max_depth': 4, 'min_samples_leaf': 5, 'min_samples_split': 2}
# Best AUC score
tree1.best_score_
np.float64(0.969819392792457)
def make_results(model_name: str, model_object_or_metadata, metric: str):
metric_dict = {
'roc_auc': 'mean_test_roc_auc',
'accuracy': 'mean_test_accuracy',
'precision': 'mean_test_precision',
'recall': 'mean_test_recall',
'f1': 'mean_test_f1'
}
# If it's a dict (metadata), grab 'cv_results'
if isinstance(model_object_or_metadata, dict):
cv_results = pd.DataFrame(model_object_or_metadata['cv_results'])
else:
cv_results = pd.DataFrame(model_object_or_metadata.cv_results_)
best_estimator_results = cv_results.iloc[
cv_results[metric_dict[metric]].idxmax(), :
]
# Safely extract metrics
auc = best_estimator_results.get('mean_test_roc_auc', None)
f1 = best_estimator_results.get('mean_test_f1', None)
recall = best_estimator_results.get('mean_test_recall', None)
precision = best_estimator_results.get('mean_test_precision', None)
accuracy = best_estimator_results.get('mean_test_accuracy', None)
table = pd.DataFrame({
'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy],
'roc_auc': [auc]
})
return table
tree1_cv_results = make_results('decision tree cv', tree1, 'roc_auc')
tree1_cv_results
| model | precision | recall | F1 | accuracy | roc_auc | |
|---|---|---|---|---|---|---|
| 0 | decision tree cv | 0.914552 | 0.916949 | 0.915707 | 0.971978 | 0.969819 |
All of these scores from the decision tree model are strong indicators of good model performance.
# Instantiate model
rf = RandomForestClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3,5, None],
'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3,4],
'n_estimators': [300, 500],
}
# Assign a dictionary of scoring metrics to capture
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
# Instantiate GridSearch
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)
# %%time
# To don't repeat it each time, write the pickle file and load it later
rf1.fit(X_train_tree1, y_train_tree1) # --> Wall time: ~10min
# Define a path to the folder where you want to save the model
path = '/home/jovyan/work/'
def write_pickle(base_path, model_object, save_as:str):
full_path = os.path.join(base_path, save_as + '.pickle')
print(f"Attempting to save pickle to: {full_path}") # Add this for debugging
with open(full_path, 'wb') as to_write:
pickle.dump(model_object, to_write)
def read_pickle(path, saved_model_name:str):
file_path = os.path.join(path, saved_model_name + '.pickle')
with open(file_path, 'rb') as to_read:
model = pickle.load(to_read)
return model
# Write pickle
# Skip if the file has already been written
write_pickle(path, rf1, 'hr_rf1')
# Read pickle
rf1 = read_pickle(path, 'hr_rf1')
# Check best AUC score on CV
rf1.best_score_
np.float64(0.9804250949807172)
# Check best params
rf1.best_params_
{'max_depth': 5,
'max_features': 1.0,
'max_samples': 0.7,
'min_samples_leaf': 1,
'min_samples_split': 4,
'n_estimators': 500}
# Get all CV scores
rf1_cv_results = make_results('random forest cv', rf1, 'roc_auc')
print(tree1_cv_results)
print(rf1_cv_results)
model precision recall F1 accuracy roc_auc
0 decision tree cv 0.914552 0.916949 0.915707 0.971978 0.969819
model precision recall F1 accuracy roc_auc
0 random forest cv 0.950023 0.915614 0.932467 0.977983 0.980425
def get_scores(model_name: str, model, X_test_data, y_test_data):
preds = model.predict(X_test_data) # predicted classes
if hasattr(model, "predict_proba"):
probs = model.predict_proba(X_test_data)[:, 1] # positive class probabilities
else:
probs = preds # fallback or raise error here
roc_auc = roc_auc_score(y_test_data, probs)
accuracy = accuracy_score(y_test_data, preds)
precision = precision_score(y_test_data, preds)
recall = recall_score(y_test_data, preds)
f1 = f1_score(y_test_data, preds)
return pd.DataFrame({
'model': [model_name],
'roc_auc': [roc_auc],
'accuracy': [accuracy],
'precision': [precision],
'recall': [recall],
'f1': [f1]
})
# Get predictions on test data
rf1_test_scores = get_scores('random forest1 test', rf1, X_test_tree1, y_test_tree1)
rf1_test_scores
| model | roc_auc | accuracy | precision | recall | f1 | |
|---|---|---|---|---|---|---|
| 0 | random forest1 test | 0.984643 | 0.980987 | 0.964211 | 0.919679 | 0.941418 |
# Generate array of values for confusion matrix
preds_rf1 = rf1.best_estimator_.predict(X_test_tree1)
cm_rf1 = confusion_matrix(y_test_tree1, preds_rf1, labels=rf1.classes_)
# Plot confusion matrix
disp_rf1 = ConfusionMatrixDisplay(confusion_matrix=cm_rf1,
display_labels=rf1.classes_)
disp_rf1.plot(values_format='')
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1a88aeb4200>
# Get feature importances
feat_impt = rf1.best_estimator_.feature_importances_
# Get indices of top 10 features
ind = np.argpartition(rf1.best_estimator_.feature_importances_, -10)[-10:]
# Get column labels of top 10 features
feat = X.columns[ind]
# Filter `feat_impt` to consist of top 10 feature importances
feat_impt = feat_impt[ind]
y_df1 = pd.DataFrame({"feature":feat,"importance":feat_impt})
y_sort_df1 = y_df1.sort_values("importance")
# Plot the feature importances
fig = plt.figure()
ax1 = fig.add_subplot(111)
y_sort_df1.plot(kind='barh',ax=ax1,x="feature",y="importance")
ax1.set_title("Random Forest: Feature Importances for Employee Leaving", fontsize=12)
ax1.set_ylabel("feature")
ax1.set_xlabel("importance")
plt.show()
These high evaluation scores are unusual. There is a chance that there is some data misinterpretation that coudl lead to unmeaningful results.
It's likely that the company dooes not report satisfaction levels for all of its employees. It's also possible that the average_monthly_hours column is a source of some data badly sampled. If employees have already decided upon quitting, or have already been identified by management as people to be fired, they may be working fewer hours.
It is reasonable to drop satisfaction_level and creating a new feature that roughly captures whether an employee is overworked. This new feature can be called overworked. It will be a binary variable.
# Drop `satisfaction_level` and save resulting dataframe in new variable
df2 = df_enc.drop('satisfaction_level', axis=1)
# Create `overworked` column. For now, it's identical to average monthly hours.
df2['overworked'] = df2['average_monthly_hours']
# Inspect max and min average monthly hours values
print('Max hours:', df2['overworked'].max())
print('Min hours:', df2['overworked'].min())
Max hours: 310 Min hours: 96
166.67 is approximately the average number of monthly hours for someone who works 50 weeks per year, 5 days per week, 8 hours per day.
It is possible to being overworked as working more than 175 hours per month on average.
To make the overworked column binary, you could reassign the column using a boolean mask.
df3['overworked'] > 175 creates a series of booleans, consisting of True for every value > 175 and False for every values ≤ 175# Define `overworked` as working > 175 hrs/week
df2['overworked'] = (df2['overworked'] > 175).astype(int)
# Display first few rows of new column
df2['overworked'].head()
0 0 1 1 2 1 3 1 4 0 Name: overworked, dtype: int64
df2 = df2.drop('average_monthly_hours', axis=1)
# Isolate the outcome variable
y = df2['left']
# Select the features
X = df2.drop('left', axis=1)
# Create test data
X_train_tree2, X_test_tree2, y_train_tree2, y_test_tree2 = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
# Instantiate model
tree = DecisionTreeClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
# Assign a dictionary of scoring metrics to capture
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
# Instantiate GridSearch
tree2 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
%%time
tree2.fit(X_train_tree2, y_train_tree2)
CPU times: total: 2.75 s Wall time: 2.75 s
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring={'accuracy': 'accuracy', 'f1': 'f1',
'precision': 'precision', 'recall': 'recall',
'roc_auc': 'roc_auc'})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring={'accuracy': 'accuracy', 'f1': 'f1',
'precision': 'precision', 'recall': 'recall',
'roc_auc': 'roc_auc'})DecisionTreeClassifier(max_depth=6, min_samples_leaf=2, min_samples_split=6,
random_state=0)DecisionTreeClassifier(max_depth=6, min_samples_leaf=2, min_samples_split=6,
random_state=0)y_pred_tree2 = tree2.best_estimator_.predict(X_test_tree2)
# Check best params
tree2.best_params_
{'max_depth': 6, 'min_samples_leaf': 2, 'min_samples_split': 6}
# Check best AUC score on CV
tree2.best_score_
np.float64(0.9586752505340426)
# Get all CV scores
tree2_cv_results = make_results('decision tree2 cv', tree2, 'roc_auc')
print(tree1_cv_results)
print(tree2_cv_results)
model precision recall F1 accuracy roc_auc
0 decision tree cv 0.914552 0.916949 0.915707 0.971978 0.969819
model precision recall F1 accuracy roc_auc
0 decision tree2 cv 0.856693 0.903553 0.878882 0.958523 0.958675
# Instantiate model
rf = RandomForestClassifier(random_state=0)
# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3,5, None],
'max_features': [1.0],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3,4],
'n_estimators': [300, 500],
}
# Assign a dictionary of scoring metrics to capture
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
# Instantiate GridSearch
rf2 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc')
# %%time
# Skp if the file has already been written
# rf2.fit(X_train_tree2, y_train_tree2) # --> Wall time: 7min 5s
# Write pickle
# Skip this step if you already have the model saved
# write_pickle(path, rf2, 'hr_rf2')
# Read in pickle
rf2 = read_pickle(path, 'hr_rf2')
# Check best params
rf2.best_params_
{'max_depth': 5,
'max_features': 1.0,
'max_samples': 0.7,
'min_samples_leaf': 2,
'min_samples_split': 2,
'n_estimators': 300}
# Check best AUC score on CV
rf2.best_score_
np.float64(0.9648100662833985)
# Get all CV scores
rf2_cv_results = make_results('random forest2 cv', rf2, 'roc_auc')
print(tree2_cv_results)
print(rf2_cv_results)
model precision recall F1 accuracy roc_auc
0 decision tree2 cv 0.856693 0.903553 0.878882 0.958523 0.958675
model precision recall F1 accuracy roc_auc
0 random forest2 cv 0.866758 0.878754 0.872407 0.957411 0.96481
# Get predictions on test data
rf2_test_scores = get_scores('random forest2 test', rf2, X_test_tree2, y_test_tree2)
rf2_test_scores
| model | roc_auc | accuracy | precision | recall | f1 | |
|---|---|---|---|---|---|---|
| 0 | random forest2 test | 0.968497 | 0.961641 | 0.870406 | 0.903614 | 0.8867 |
# Generate array of values for confusion matrix
preds_rf2 = rf2.best_estimator_.predict(X_test_tree2)
cm_rf2 = confusion_matrix(y_test_tree2, preds_rf2, labels=rf2.classes_)
# Plot confusion matrix
disp_rf2 = ConfusionMatrixDisplay(confusion_matrix=cm_rf2,
display_labels=rf2.classes_)
disp_rf2.plot(values_format='')
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1a88c974740>
# Plot the tree
plt.figure(figsize=(85,20))
plot_tree(tree2.best_estimator_, max_depth=6, fontsize=14, feature_names=X.columns,
class_names={0:'stayed', 1:'left'}, filled=True)
plt.show()
Gini importance (also known as Mean Decrease in Impurity) is a metric used in decision trees and random forests to measure the importance of each feature in making predictions. It is based on the Gini impurity, which quantifies how mixed the classes are within a node.
How It Works:
#tree2_importances = pd.DataFrame(tree2.best_estimator_.feature_importances_, columns=X.columns)
tree2_importances = pd.DataFrame(tree2.best_estimator_.feature_importances_,
columns=['gini_importance'],
index=X.columns
)
tree2_importances = tree2_importances.sort_values(by='gini_importance', ascending=False)
# Only extract the features with importances > 0
tree2_importances = tree2_importances[tree2_importances['gini_importance'] != 0]
tree2_importances
| gini_importance | |
|---|---|
| last_evaluation | 0.343958 |
| number_project | 0.343385 |
| tenure | 0.215681 |
| overworked | 0.093498 |
| department_support | 0.001142 |
| salary | 0.000910 |
| department_sales | 0.000607 |
| department_technical | 0.000418 |
| work_accident | 0.000183 |
| department_IT | 0.000139 |
| department_marketing | 0.000078 |
sns.barplot(data=tree2_importances, x="gini_importance", y=tree2_importances.index, orient='h')
plt.title("Decision Tree: Feature Importances for Employee Leaving", fontsize=12)
plt.ylabel("Feature")
plt.xlabel("Importance")
plt.show()
The barplot above shows that in this decision tree model, last_evaluation, number_project, tenure, and overworked have the highest importance, in that order. These variables are most helpful in predicting the outcome variable, left.
Now, plotting the feature importances for the random forest model.
# Get feature importances
feat_impt = rf2.best_estimator_.feature_importances_
# Get indices of top 10 features
ind = np.argpartition(rf2.best_estimator_.feature_importances_, -10)[-10:]
# Get column labels of top 10 features
feat = X.columns[ind]
# Filter `feat_impt` to consist of top 10 feature importances
feat_impt = feat_impt[ind]
y_df2 = pd.DataFrame({"feature":feat,"importance":feat_impt})
y_sort_df2 = y_df2.sort_values("importance")
# Plot the feature importances
fig = plt.figure()
ax1 = fig.add_subplot(111)
y_sort_df2.plot(kind='barh',ax=ax1,x="feature",y="importance")
ax1.set_title("Random Forest: Feature Importances for Employee Leaving", fontsize=12)
ax1.set_ylabel("feature")
ax1.set_xlabel("importance")
plt.show()
y_sort_df2
| feature | importance | |
|---|---|---|
| 0 | work_accident | 0.000276 |
| 1 | department_IT | 0.000291 |
| 2 | department_technical | 0.000414 |
| 3 | department_support | 0.000578 |
| 4 | department_sales | 0.000612 |
| 5 | salary | 0.000644 |
| 6 | overworked | 0.080984 |
| 7 | tenure | 0.199109 |
| 8 | number_project | 0.356801 |
| 9 | last_evaluation | 0.359494 |
The plot above shows that in this random forest model, last_evaluation, number_project, tenure, and overworked have the highest importance, in that order. These variables are most helpful in predicting the outcome variable, left, and they are the same as the ones used by the decision tree model.
# True labels
y_true = [0, 0, 1, 1]
# Predicted probabilities
y_scores = [0.1, 0.4, 0.35, 0.8]
# Compute AUC
auc_score = roc_auc_score(y_true, y_scores)
print(f"AUC Score: {auc_score:.2f}")
AUC Score: 0.75
# Check best params
rf2.best_params_
{'max_depth': 5,
'max_features': 1.0,
'max_samples': 0.7,
'min_samples_leaf': 2,
'min_samples_split': 2,
'n_estimators': 300}
# Isolate the outcome variable
# df2.copy
y = df_enc['left']
X = df_enc.drop('left', axis=1)
X_train_xgb1, X_test_xgb1, y_train_xgb1, y_test_xgb1 = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
# Fit the model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
cv_params_xgb = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 300, 500],
'subsample': [0.7, 0.8, 1.0], # Fraction of samples used for fitting the individual base learners
'colsample_bytree': [0.7, 0.8, 1.0], # Subsample ratio of columns when constructing each tree
'gamma': [0, 0.1, 0.2], # Minimum loss reduction required to make a further partition
'lambda': [1, 1.5, 2], # L2 regularization term on weights
'alpha': [0, 0.1, 0.5] # L1 regularization term on weights
}
# Assign a dictionary of scoring metrics to capture
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
# Set up the GridSearchCV
xgb_model1_fit = GridSearchCV(
estimator=xgb_model,
param_grid=cv_params_xgb,
scoring=scoring,
cv=4,
refit='roc_auc',
verbose=1, # optional, for more info
n_jobs=-1 # to leverage all CPU cores.
)
# %%time
# Skip if the file has already been written
# xgb_model1_fit.fit(X_train_xgb1, y_train_xgb1)
def write_xgb_json(base_path, model_object, save_as:str):
full_path = os.path.join(base_path, save_as + '.json')
print(f"Attempting to save XGBoost model to: {full_path}")
model_object.save_model(full_path)
# Skip this step if you already have the model saved
#write_xgb_json(path, xgb_model1_fit.best_estimator_, "hr_xgb_model1")
# This is only needed if you want to save the model as a JSON file
# Skip this step if you already have the metadata saved
best_score = xgb_model1_fit.best_score_
cv_results = xgb_model1_fit.cv_results_
scorer = xgb_model1_fit.scorer_
refit_time = xgb_model1_fit.refit_time_
metadata = {
'best_params': xgb_model1_fit.best_params_,
'best_score': xgb_model1_fit.best_score_,
'cv_results': xgb_model1_fit.cv_results_,
'scorer': xgb_model1_fit.scorer_,
'refit_time': xgb_model1_fit.refit_time_,
'n_splits': xgb_model1_fit.n_splits_
}
# Save the dictionary using pickle
#with open("xgb_model1_metadata.pkl", "wb") as f:
#pickle.dump(metadata, f)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[104], line 2 1 # This is only needed if you want to save the model as a JSON file ----> 2 best_score = xgb_model1_fit.best_score_ 3 cv_results = xgb_model1_fit.cv_results_ 4 scorer = xgb_model1_fit.scorer_ AttributeError: 'GridSearchCV' object has no attribute 'best_score_'
def read_xgb_json(path, saved_model_name:str):
full_path = os.path.join(path, saved_model_name + '.json')
print(f"Attempting to load XGBoost model from: {full_path}")
model = XGBClassifier()
model.load_model(full_path)
return model
# Load model
# # Run this cell only if you have already saved the model as a JSON file and the metadata as a pickle file
loaded_xgb_model1 = XGBClassifier()
loaded_xgb_model1.load_model("hr_xgb_model1.json")
# Load metadata
with open("xgb_model1_metadata.pkl", "rb") as f:
metadata = pickle.load(f)
# Use them
print(metadata['best_params'])
print(metadata['best_score'])
{'alpha': 0, 'colsample_bytree': 1.0, 'gamma': 0.1, 'lambda': 2, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100, 'subsample': 0.8}
0.9862188441742771
# Get all CV scores
xgb_model1_cv_results = make_results('XGBoost 1 CV (tuned)', metadata, 'roc_auc')
print(rf1_cv_results)
print(rf2_cv_results)
print(xgb_model1_cv_results)
model precision recall F1 accuracy roc_auc
0 random forest cv 0.950023 0.915614 0.932467 0.977983 0.980425
model precision recall F1 accuracy roc_auc
0 random forest2 cv 0.866758 0.878754 0.872407 0.957411 0.96481
model precision recall F1 accuracy roc_auc
0 XGBoost 1 CV (tuned) 0.975793 0.916281 0.945066 0.98232 0.986219
# Get predictions on test data
xgb_model1_test_scores = get_scores('XGB Model 1', loaded_xgb_model1, X_test_xgb1, y_test_xgb1)
xgb_model1_test_scores
| model | roc_auc | accuracy | precision | recall | f1 | |
|---|---|---|---|---|---|---|
| 0 | XGB Model 1 | 0.987859 | 0.982655 | 0.978541 | 0.915663 | 0.946058 |
preds_xgb1 = loaded_xgb_model1.predict(X_test_xgb1)
cm_xgb1 = confusion_matrix(y_test_xgb1, preds_xgb1, labels=loaded_xgb_model1.classes_)
disp_xgb1 = ConfusionMatrixDisplay(confusion_matrix=cm_xgb1,
display_labels=loaded_xgb_model1.classes_)
disp_xgb1.plot(values_format='')
plt.title('Confusion Matrix - XGBoost Model 1 (tuned)')
plt.show()
importance_xgb1 = loaded_xgb_model1.get_booster().get_score(importance_type='gain')
importance_df1 = pd.DataFrame({
'feature': list(importance_xgb1.keys()),
'importance': list(importance_xgb1.values())
}).sort_values(by='importance', ascending=False)
print(importance_df1)
feature importance 0 satisfaction_level 45.783989 4 tenure 36.121506 2 number_project 16.978743 1 last_evaluation 14.657480 3 average_monthly_hours 6.443074 5 work_accident 5.077404 6 salary 3.423650 11 department_product_mng 2.317812 13 department_support 2.220171 9 department_accounting 1.897828 14 department_technical 1.864844 7 department_IT 1.660149 12 department_sales 1.520898 8 department_RandD 1.387220 10 department_marketing 0.161139
plt.figure(figsize=(12, 6))
sns.barplot(x='importance', y='feature', data=importance_df1, palette='viridis')
plt.title('XGBoost Feature Importances (Gain)')
plt.xlabel('Importance (Gain)')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
C:\Users\barba\AppData\Local\Temp\ipykernel_7988\3870074669.py:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='importance', y='feature', data=importance_df1, palette='viridis')
y = df2['left']
# Select the features
X = df2.drop('left', axis=1)
X_train_xgb2, X_test_xgb2, y_train_xgb2, y_test_xgb2 = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)
# Fit the model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
# Assign a dictionary of scoring metrics to capture
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
# Set up the GridSearchCV
xgb_model2_fit = GridSearchCV(
estimator=xgb_model,
param_grid=cv_params_xgb,
scoring=scoring,
cv=4,
refit='roc_auc',
verbose=2, # optional, for more info
n_jobs=-1 # to leverage all CPU cores.
)
# %%time
# Skip if the file has already been written
# xgb_model2_fit.fit(X_train_xgb2, y_train_xgb2)
# Skip this step if you already have the model saved
#write_xgb_json(path, xgb_model2_fit.best_estimator_, "hr_xgb_model2")
# Skip this step if you already have the metadata saved
best_params = xgb_model2_fit.best_params_
best_score = xgb_model2_fit.best_score_
cv_results = xgb_model2_fit.cv_results_
scorer = xgb_model2_fit.scorer_
refit_time = xgb_model2_fit.refit_time_
metadata = {
'best_params': xgb_model2_fit.best_params_,
'best_score': xgb_model2_fit.best_score_,
'cv_results': xgb_model2_fit.cv_results_,
'scorer': xgb_model2_fit.scorer_,
'refit_time': xgb_model2_fit.refit_time_,
'n_splits': xgb_model2_fit.n_splits_
}
# Save the dictionary using pickle
#with open("xgb_model2_metadata.pkl", "wb") as f:
#pickle.dump(metadata, f)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[117], line 1 ----> 1 best_params = xgb_model2_fit.best_params_ 2 best_score = xgb_model2_fit.best_score_ 3 cv_results = xgb_model2_fit.cv_results_ AttributeError: 'GridSearchCV' object has no attribute 'best_params_'
# Load model
# # Run this cell only if you have already saved the model as a JSON file and the metadata as a pickle file
loaded_xgb_model2 = XGBClassifier()
loaded_xgb_model2.load_model("hr_xgb_model2.json")
# Load metadata
with open("xgb_model2_metadata.pkl", "rb") as f:
metadata = pickle.load(f)
# Use them
print(metadata['best_params'])
print(metadata['best_score'])
{'alpha': 0, 'colsample_bytree': 1.0, 'gamma': 0.2, 'lambda': 1, 'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 500, 'subsample': 1.0}
0.9734895786440338
# Get all CV scores
xgb_model2_cv_results = make_results('XGBoost 2 CV (tuned)', metadata, 'roc_auc')
print(rf1_cv_results)
print(rf2_cv_results)
print(xgb_model1_cv_results)
print(xgb_model2_cv_results)
model precision recall F1 accuracy roc_auc
0 random forest cv 0.950023 0.915614 0.932467 0.977983 0.980425
model precision recall F1 accuracy roc_auc
0 random forest2 cv 0.866758 0.878754 0.872407 0.957411 0.96481
model precision recall F1 accuracy roc_auc
0 XGBoost 1 CV (tuned) 0.975793 0.916281 0.945066 0.98232 0.986219
model precision recall F1 accuracy roc_auc
0 XGBoost 2 CV (tuned) 0.909276 0.892835 0.900966 0.967419 0.97349
# Get predictions on test data
xgb_model2_test_scores = get_scores('xgb model2', loaded_xgb_model2, X_test_xgb2, y_test_xgb2)
xgb_model2_test_scores
| model | roc_auc | accuracy | precision | recall | f1 | |
|---|---|---|---|---|---|---|
| 0 | xgb model2 | 0.97502 | 0.965644 | 0.897384 | 0.895582 | 0.896482 |
preds_xgb2 = loaded_xgb_model2.predict(X_test_xgb2)
cm_xgb2 = confusion_matrix(y_test_xgb2, preds_xgb2, labels=loaded_xgb_model2.classes_)
disp_xgb2 = ConfusionMatrixDisplay(confusion_matrix=cm_xgb2,
display_labels=loaded_xgb_model2.classes_)
disp_xgb2.plot(values_format='')
plt.title('Confusion Matrix - XGBoost 2 (Dataset 2 - "overworked")')
plt.show()
importance_xgb2 = loaded_xgb_model2.get_booster().get_score(importance_type='gain')
importance_df2 = pd.DataFrame({
'feature': list(importance_xgb2.keys()),
'importance': list(importance_xgb2.values())
}).sort_values(by='importance', ascending=False)
print(importance_df2)
feature importance 1 number_project 58.216866 2 tenure 49.423893 0 last_evaluation 31.929428 13 overworked 30.090992 3 work_accident 8.550909 5 salary 5.534005 4 promotion_last_5years 3.924562 11 department_support 1.628597 7 department_RandD 0.944614 10 department_sales 0.931679 12 department_technical 0.923531 6 department_IT 0.820470 9 department_product_mng 0.538557 8 department_marketing 0.499823
plt.figure(figsize=(12, 6))
sns.barplot(x='importance', y='feature', data=importance_df2, palette='viridis')
plt.title('XGBoost Feature Importances (Gain)')
plt.xlabel('Importance (Gain)')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()
C:\Users\barba\AppData\Local\Temp\ipykernel_7988\1161479862.py:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x='importance', y='feature', data=importance_df2, palette='viridis')
# Create a 2x2 subplot grid
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
# Plot for Random Forest 1
y_sort_df1.plot(
kind='barh',
x='feature',
y='importance',
ax=axes[0, 0],
color='steelblue',
legend=False
)
axes[0, 0].set_title("Random Forest 1: Feature Importances")
axes[0, 0].set_xlabel("Importance")
axes[0, 0].set_ylabel("Feature")
# Plot for Random Forest 2
y_sort_df2.plot(
kind='barh',
x='feature',
y='importance',
ax=axes[0, 1],
color='seagreen',
legend=False
)
axes[0, 1].set_title("Random Forest 2: Feature Importances")
axes[0, 1].set_xlabel("Importance")
axes[0, 1].set_ylabel("Feature")
# Plot for XGBoost 1
sns.barplot(
x='importance',
y='feature',
data=importance_df1,
palette='viridis',
ax=axes[1, 0]
)
axes[1, 0].set_title("XGBoost 1: Feature Importances (Gain)")
axes[1, 0].set_xlabel("Importance (Gain)")
axes[1, 0].set_ylabel("Feature")
# Plot for XGBoost 2
sns.barplot(
x='importance',
y='feature',
data=importance_df2,
palette='viridis',
ax=axes[1, 1]
)
axes[1, 1].set_title("XGBoost 2: Feature Importances (Gain)")
axes[1, 1].set_xlabel("Importance (Gain)")
axes[1, 1].set_ylabel("Feature")
plt.tight_layout()
plt.show()
C:\Users\barba\AppData\Local\Temp\ipykernel_7988\2791152204.py:31: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot( C:\Users\barba\AppData\Local\Temp\ipykernel_7988\2791152204.py:43: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(
classes_rf1 = rf1.classes_
classes_rf2 = rf2.classes_
classes_xgb1 = np.unique(y_train_xgb1)
classes_xgb2 = np.unique(y_train_xgb2)
# --- Plotting the Confusion Matrices in a 2x2 Subplot Grid ---
fig, axes = plt.subplots(2, 2, figsize=(14, 12)) # Adjust figsize for better visualization
# Plot for Random Forest 1
disp_rf1 = ConfusionMatrixDisplay(confusion_matrix=cm_rf1, display_labels=classes_rf1)
disp_rf1.plot(ax=axes[0, 0], values_format='') # values_format='' to show raw numbers
axes[0, 0].set_title('Confusion Matrix: Random Forest 1 (Dataset 1)')
axes[0, 0].set_xlabel('Predicted label')
axes[0, 0].set_ylabel('True label')
# Plot for Random Forest 2
disp_rf2 = ConfusionMatrixDisplay(confusion_matrix=cm_rf2, display_labels=classes_rf2)
disp_rf2.plot(ax=axes[0, 1], values_format='')
axes[0, 1].set_title('Confusion Matrix: Random Forest 2 (Dataset 2 - "overworked")')
axes[0, 1].set_xlabel('Predicted label')
axes[0, 1].set_ylabel('True label')
# Plot for XGBoost Model 1
disp_xgb1 = ConfusionMatrixDisplay(confusion_matrix=cm_xgb1, display_labels=classes_xgb1)
disp_xgb1.plot(ax=axes[1, 0], values_format='')
axes[1, 0].set_title('Confusion Matrix: XGBoost 1 (Dataset 1)')
axes[1, 0].set_xlabel('Predicted label')
axes[1, 0].set_ylabel('True label')
# Plot for XGBoost Model 2
disp_xgb2 = ConfusionMatrixDisplay(confusion_matrix=cm_xgb2, display_labels=classes_xgb2)
disp_xgb2.plot(ax=axes[1, 1], values_format='')
axes[1, 1].set_title('Confusion Matrix: XGBoost 2 (Dataset 2 - "overworked")')
axes[1, 1].set_xlabel('Predicted label')
axes[1, 1].set_ylabel('True label')
plt.tight_layout() # Adjusts subplot params for a tight layout to prevent overlap
plt.show()
The configurations confirm that the search strategy for hyperparameters was the same for paired models (tree1/tree2, rf1/rf2, xgb_model1/xgb_model2). Therefore, the stark differences in feature importances observed (especially the presence/absence of overworked and the varying importance of satisfaction_level, average_monthly_hours, etc.) are due to the models being fitted on datasets with different feature sets.
rf1 and xgb_model1 were trained on a dataset without the overworked feature but including satisfaction_level and average_monthly_hours. rf2 and xgb_model2 were trained on a dataset with the overworked feature, which then took on some of the predictive importance previously held by other features.
features_for_clustering_df2 = [
'last_evaluation', # Performance metric
'number_project', # Workload indicator
'tenure', # Experience/loyalty indicator
'work_accident', # Binary indicator
'promotion_last_5years', # Binary indicator
'salary', # Ordinally encoded salary
'overworked' # Your engineered workload indicator
]
X_cluster = df2[features_for_clustering_df2].copy()
# Example: Assuming X_cluster contains your selected features for clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)
inertia = []
K_range = range(1, 11) # Example range
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto') # n_init='auto' or 10
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
plt.figure(figsize=(8, 6))
plt.plot(K_range, inertia, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method for Optimal K')
plt.show()
silhouette_avg_scores = []
K_range = range(2, 11) # Silhouette score requires at least 2 clusters
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
cluster_labels = kmeans.fit_predict(X_scaled)
silhouette_avg = silhouette_score(X_scaled, cluster_labels)
silhouette_avg_scores.append(silhouette_avg)
print(f"For K={k}, the average silhouette_score is : {silhouette_avg:.4f}")
plt.figure(figsize=(8, 6))
plt.plot(K_range, silhouette_avg_scores, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Average Silhouette Score')
plt.title('Silhouette Score for Optimal K')
plt.show()
For K=2, the average silhouette_score is : 0.6188 For K=3, the average silhouette_score is : 0.2251 For K=4, the average silhouette_score is : 0.2124 For K=5, the average silhouette_score is : 0.1969 For K=6, the average silhouette_score is : 0.2326 For K=7, the average silhouette_score is : 0.2186 For K=8, the average silhouette_score is : 0.2235 For K=9, the average silhouette_score is : 0.2218 For K=10, the average silhouette_score is : 0.2498
optimal_k = 3 # Replace with your chosen K
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init='auto')
df2['cluster'] = kmeans_final.fit_predict(X_scaled) # Add cluster labels back to your original (or relevant) dataframe
centroids_scaled = kmeans_final.cluster_centers_
centroids_original_scale = scaler.inverse_transform(centroids_scaled) # If X_cluster was a DataFrame
centroid_df = pd.DataFrame(centroids_original_scale, columns=X_cluster.columns) # Assuming X_cluster was a DataFrame
print(centroid_df)
last_evaluation number_project tenure work_accident \ 0 0.707438 3.798030 3.940887 0.236453 1 0.746450 4.033307 3.426592 0.154676 2 0.664942 3.399113 3.229332 0.149696 promotion_last_5years salary overworked 0 1.000000e+00 1.029557 6.305419e-01 1 -1.179612e-16 0.602451 9.994671e-01 2 -1.006140e-16 0.586175 9.992007e-15
cluster_profile = df2.groupby('cluster')[[
'last_evaluation',
'number_project',
'tenure',
'salary', # Ordinally encoded salary
'work_accident',
'promotion_last_5years',
'overworked'
]].mean()
print(cluster_profile)
last_evaluation number_project tenure salary work_accident \
cluster
0 0.707438 3.798030 3.940887 1.029557 0.236453
1 0.746450 4.033307 3.426592 0.602451 0.154676
2 0.664942 3.399113 3.229332 0.586175 0.149696
promotion_last_5years overworked
cluster
0 1.0 0.630542
1 0.0 0.999467
2 0.0 0.000000
churn_rate_per_cluster = df2.groupby('cluster')['left'].mean()
print(churn_rate_per_cluster)
cluster 0 0.039409 1 0.146416 2 0.206446 Name: left, dtype: float64
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x='cluster', y='number_project', data=df2)
plt.title('Number of Projects by Cluster (using df2)')
plt.subplot(1, 2, 2)
sns.boxplot(x='cluster', y='tenure', data=df2)
plt.title('Tenure by Cluster (using df2)')
plt.tight_layout()
plt.show()
# Now df2 has the 'cluster' column
# Calculate the mean of 'overworked' for each cluster (proportion of overworked employees)
overworked_proportion_per_cluster = df2.groupby('cluster')['overworked'].mean()
plt.figure(figsize=(8, 6))
overworked_proportion_per_cluster.plot(kind='bar')
plt.title('Proportion of "Overworked" Employees by Cluster (using df2)')
plt.xlabel('Cluster')
plt.ylabel('Proportion Overworked')
plt.xticks(rotation=0)
plt.show()
churn_rate_per_cluster.plot(kind='bar')
plt.title('Churn Rate by Cluster')
plt.ylabel('Proportion of Employees Who Left')
plt.show()
Cluster 0: "The Recognized & Retained Core"
Profile: These are employees with moderate to high tenure and medium salaries who have all been promoted. They handle a moderate number of projects, have moderate-to-high evaluations, and a fair portion are overworked. Their work accident rate is the highest.
Churn: Extremely low (3.9%).
Insight for HR: Promotions are a powerful retention tool. This group, despite some being overworked, stays. The company is successfully retaining its promoted talent. The higher work accident rate warrants investigation for this group – are they in roles with higher risk, or is there a correlation with their work patterns?
Cluster 1: "The Overworked High Contributors (At Risk)"
Profile: These employees have high evaluations and manage a moderate-high number of projects. They have moderate tenure and low-to-medium salaries. Critically, none have been promoted, and virtually all are overworked.
Churn: Moderate (14.6%).
Insight for HR: This is a key group to focus on for retention. They are performing well but are likely feeling the strain of being overworked without the recognition of promotion or higher pay. They are a flight risk due to potential burnout or feeling undervalued.
HR Actions: Review workload distribution, explore non-promotional recognition, ensure fair compensation for their contribution level, and identify pathways for career advancement for these high contributors.
Cluster 2: "The Underutilized & Disengaged (Highest Churn)"
Profile: This group has the lowest average tenure, project count, and evaluation scores among the three clusters. They have low-to-medium salaries, no promotions, and importantly, they are not overworked (working normal or fewer hours).
Churn: Highest (20.6%).
Insight for HR: This is a very interesting finding. The highest churn comes from the group that isn't overworked. This suggests that factors other than workload are primary drivers for their departure. Possible reasons include:
Summary of K-means Insights:
K-means has successfully identified three distinct employee segments with significantly different characteristics and churn rates.
Promotions are strongly linked to retention (Cluster 0).
A segment of high-performing, overworked, but unpromoted employees exists with a moderate churn risk (Cluster 1).
The highest churn comes from a segment that is not overworked but has lower evaluations, project counts, and no promotions, suggesting disengagement or lack of growth opportunities (Cluster 2).
Revised Summary of Model Results
Several models were evaluated to predict employee churn, including Logistic Regression, Decision Trees, Random Forests, and XGBoost. These models were tested on two primary feature sets: one incorporating original employee metrics like satisfaction_level and average_monthly_hours (Dataset 1), and another that excluded these, instead using an engineered overworked feature (Dataset 2).
Logistic Regression
The Logistic Regression model, when applied (likely after addressing outliers for tenure), served as a baseline. It achieved an accuracy of around 82% but showed a limited ability to correctly identify employees who would leave, recall of 0.27 for the "left" class. This suggests that the model tends to predict "stayed" more often than "left," leading to a large number of false negatives. For a business looking to proactively retain employees, this model may miss a substantial portion of those at risk of leaving.
Tree-based Machine Learning
Performance on Dataset 1 (with satisfaction_level, average_monthly_hours): Models in this group generally showed very high predictive performance, likely influenced significantly by satisfaction_level.
Performance on Dataset 2 (with engineered overworked feature; satisfaction_level & average_monthly_hours removed): These models aimed to predict churn without relying on potentially problematic leading indicators like satisfaction_level.
Summary of Model Results
When satisfaction_level is included (Dataset 1): It is by far the most dominant predictor.
When satisfaction_level is excluded and overworked is introduced (Dataset 2): number_project, tenure, last_evaluation, and the engineered overworked feature become the key drivers. This suggests that workload indicators (number_project, overworked) and employee experience/evaluation (tenure, last_evaluation) are critical factors in predicting churn once direct satisfaction measures are removed.
The predictive models and feature importance analyses confirm that employee workload and career-related factors are significant drivers of attrition at Salifort Motors. The models utilizing an engineered overworked feature, alongside variables like number_project, tenure, and last_evaluation, were particularly insightful for identifying at-risk employees without relying on potentially lagging indicators like satisfaction_level.
Furthermore, K-means clustering revealed distinct employee segments: notably, a highly overworked group of contributors with moderate churn, and a group that isn't overworked but exhibits the highest churn, suggesting issues related to engagement or growth opportunities. A third, low-churn group was characterized by having received promotions. These findings underscore that "overwork," while a key factor, interacts with recognition and engagement to influence an employee's decision to leave.
To retain employees, the following recommendations, based on the data, could be presented to stakeholders:
number_project that employees are assigned, as this was a top predictor of churn risk, particularly for the "Overworked High Contributors" segment.tenure is a strong predictor across different employee segments.overworked feature or high average_monthly_hours) are adequately rewarded and recognized, or adjust workload expectations to prevent burnout, particularly for the "Overworked High Contributors."last_evaluation scores are part of a fair and constructive feedback process. Critically, high evaluation scores should not be implicitly tied only to employees working excessive hours; reward contributions proportionately.Next Steps
last_evaluation Further: Given its consistent importance, it's prudent to explore the timing and nature of these evaluations. Determine if there's a risk of data leakage (e.g., evaluations occurring after an employee's disengagement has already begun). Consider building models with and without this feature to understand its true proactive predictive power.number_project or other factors.Congratulations! You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.